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Abstract 

We present a simple, parallel and distributed algorithm for setting up and partitioning a sparse representation of a 
regular discretized simulation domain. This method is scalable for a large number of processes even for complex 
geometries and ensures load balance between the domains, reasonable communication interfaces, and good data 
locality within the domain. Applying this scheme to a list-based lattice Boltzmann flow solver can achieve similar 
or even higher flow solver performance than widely used standard graph partition based tools such as METIS and 
PT-SCOTCH. 
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1. Introduction 



The lattice Boltzmann method (LBM) 111 111 is used in 
computational fluid mechanics (CFD) to simulate in- 
compressible athermal flows. In this work, the two- 
relaxation-time (TRT) collision model 150 is applied on 
a D3Q19 lattice. This means simulation takes place in 
a 3-dimensional space and each node holds 19 particle 
distribution functions (PDFs) and points to 18 surround- 
ing neighbors. 

Domain decomposition of an equidistant Cartesian 
mesh seems to be a trivial task. However, partition- 
ing complex geometries (e.g. porous media) on large 
processor counts is a challenge when the resulting parti- 
tions should contain an equal amount of work and a low 
number of interface cells to neighbor partitions. The 
lack of scalable tools, their high memory consumption 
and long run-times of the partitioner make domain de- 
composition a tedious task. Moreover, most tools only 
return a cell-to-partition mapping but not the order in 
which the cells within a partition should be processed 
later on in the flow solver. 

In this paper we present an approach to address these 
challenges. We start in Sec. 2 with a short overview 
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of different methods for representing the data structures 
of the fluid in the flow solver. Here we concentrate on 
the so called sparse representation. Next in Sec. 3, we 
describe the steps needed to setup this representation 
within our "preprocessor" as well as with METIS and 
PT-SCOTCH (Sec. 4). Finally, in Sec. 6 and 7, perfor- 
mance results for the preprocessor and the flow solver 
are summarized that were obtained from our test bed 
(Sec. 5) and an outlook to future work is given in Sec. 



2. Data structures of the LB flow solver 

Several options exist for representing the data struc- 
tures of the fluid inside the flow solver. Note that this 
representation may be different from the one used for 
partitioning/preprocessing step. 

Simple lattice Boltzmann implementations use the 
marker-and-cell (MAC) method on equidistant Carte- 
sian meshes. Both the flag field used to distinguish fluid 
from obstacle cells and the particle distribution func- 
tions (PDFs) of the lattice Boltzmann method, are stored 
in multi-dimensional arrays lfl4l] . We call this a full ar- 
rays approach since solid parts are just skipped during 
the update step, but still stored in the lattice. 

Particularly in case of regions with a high fraction of 
obstacle cells, it can be beneficial to divide the geomet- 
ric bounding box in sub-blocks ("patches") and to allo- 
cate only those which contain fluid cells. This so-called 



Preprint submitted to Computers &■ Fluids 



November 7, 2011 



patch-based approach has been successfully applied in 
e.g. Hi. 

If the simulation domain contains a high fraction of 
scattered solid matter, hke in porous media, a sparse 
representation can be used where only the PDFs of 
fluid ceUs are stored |[ll|9l[l3,[ll[l3l[il]ina one- 
dimensional array along with an adjacency list. This 
sparse "unstructured" representation requires indirect 
addressing, but adds a lot of flexibility for processing 
the nodes and allows arbitrarily shaped interfaces be- 
tween MPI partitions (i.e., no need for planar bound- 
aries). As a side effect, the bounce back boundary con- 
dition can be implicitly applied and needs no further 
handling. 

In our work, we use the ILBDC (International Lattice 
Boltzmann Development Consortium) code lIlSll . which 
implements a sparse representation of the lattice Boltz- 
mann PDFs. It is designed to efficiently simulate in 
3-D the flow through porous media with a low fraction 
of fluid space. Typical examples are packed bed reac- 
tors where a tube is randomly filled with, e.g., spheres 
or cylinders. Domain decomposition within the flow 
solver is straightforward once the sparse "unstructured" 
representation using the 1-D adjacency list is available 
and independent of the geometric structure of the simu- 
lation domain: When running with MPI processes, 
the 1-D adjacency list traversing all fluid cells is cut 
into equal-sized chunks (some may be smaller by one 
cell). Each process reads its chunk, determines the in- 
terface cells (i.e. cells with direct contact to other par- 
titions), and allocates ghost cells (i.e. "duplicated cells 
from remote") for them. As long as the computational 
work per cell is constant, load imbalance does not oc- 
cur. If the adjacency list was constructed with local- 
ity aspects in mind, also the amount of communication 
(i.e., the surface areas between adjacent partitions) is 
under control. Thus, the challenge is the assembly of 
the sparse representation and its adjacency list. 

3. Distributed assembly of the sparse representation 

The preprocessor assembles the adjacency list out of 
a full representation (fluid cells and obstacles) of the 
simulation domain. At startup, nothing is known about 
the domain except its size. Consequently it must be 
fully allocated to be analyzed. Through a geometric 
decomposition, each partition is assigned to a prepro- 
cessor rank as shown in Fig. [1] (a). Without this par- 
allel setup, domains exceeding a single node's memory 
would require a large shared memory machine. 

In the adjacency list each fluid cell needs a domain- 
wide unique contiguous index Idx, y, z) with Idx, y, z) e 



[1, . . . , A^/], where Nf is the number of the fluid cells in 
the simulation domain. It also holds true Ic{x\,y\,z\) + 
Ic{x2,y2,Z2), when {xi,yi,z\) and {x2,y2,Z2) denote two 
different fluid cells. Solid cells will receive Icix,y,z) - 
0, because they are not considered in the list. 

Several steps are required to finally obtain the 
domain-wide uniq contiguous index: 

Numbering scheme. In the first step each cell receives 
a domain-wide unique but not necessarily contiguous 
index I(x,y,z) by a numbering scheme, which is also 
known as index function. This scheme can be a lexico- 
graphic ordering of the cells with or without blocking, 
a discrete space-filling curve like the Z curve (Morton 
order) jsl], a Hilbert curve or any other positive dis- 
crete injective function f(x,y,z)- Hereby all cells, fluid 
and solid, are considered. After numbering, the fluid 
cell index contains gaps, which correspond to the in- 
dices of the solid cells. During this step the indices / 
and the fluid cell coordinates are stored in the adjacency 
list. 

In- and outcells. In the next step gaps in the index / 
are removed. Each rank determines its incells I and 
outcells O. Incells are cells with domain-wide unique 
index denoted by /(J„) whose previous cell with index 
/(J„) - 1 is located on another partition. Each incell 
I„ corresponds to an outcell 0„. The latter is the first 
cell with l{0„) > I{I„) that does not reside on the same 
partition as J„. Fig. [Tla) shows two incells and outcells 
of rank 1 . 

Contiguous index generation. Each process maintains 
an array of a data structure called incell info or short 
ici. Each ici stores for each incell I„ its index I{I„), 
the index I(On) of the corresponding outcell, the num- 
ber of fluid ceUs Aff (J„) with index I{x, y, z) for which 
/( J„) < I{x, y, z) < I{0„), the MPI rank of the local pro- 
cess, and an entry which will later receive the globally 
contiguous index of the first fluid ceU in the range of 
this ici. 

The partitions are considered as the leaves of an (in- 
complete) octree. From bottom to top on each level I 
of the tree a "sibling master" is selected among each 
group of up to eight children. This master receives its 
siblings' ici arrays and stores them in a local array A/. 
In a new array Bi a copy of A/ is stored and sorted by 
the domain-wide unique index I{I„) of the icis. For 
all icis the rank is set to the sibling master's one. Two 
icis in array Bi are merged if the outcell index /((9,) of 
ici Bi{i) equals the incell index /(J,+i) of the following 
ici B;(/-i- 1). These two elements are replaced by a new 
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Figure 1: The simulation domain is geometrically decomposed by the preprocessor and partitions are assigned to processes (a). The preprocessor 
sets up the adjacency list with a domain-wide unique contiguous index (b). This list is picked up by the flow solver and cut into equally sized 
chunks so that each process has the same number of fluid cells (c). 



one with the incell index /(J,) of Bi{i) and the outcell 
index of + 1). The number of fluid cefls as- 

sociated with Bi{i) and + I) are summed up. This 
procedure is repeated until the top level is reached, i.e., 
when no siblings are left. 

The top level process computes the domain-wide 
unique contiguous indices for all icis in its Bi array. 
The index 7^ starts at 1 for the first entry and is in- 
cremented for each further entry by the number of fluid 
cells Ns(I„-\) of the previous entry Z?/(« - 1). These 
new indices can be mapped back for each ici to the 
icis in A/. Array A/ is sorted by the rank, and subse- 
quently contiguous chunks of entries are sent back to 
their origin ranks. 

The receiving rank overwrites its Bi array with the 
received data and repeats the steps of mapping back in- 
dices and sending back icis until all leaves have been 
reached. 

Each process can now assign a domain-wide unique 
contiguous index 7^. to the fluid cells in the adjacency 
list and in the full representation. The missing adja- 
cency information can be obtained by inspecting the 
neighborhood of each fluid cell in the full representa- 
tion. Since the neighborhood may contain cells outside 
the local partition, halo layers are needed. 



Storing the adjacency information. Prior to writing the 
adjacency list (Fig. [Tib)) to disc it is sorted in parallel 
according to the unique contiguous index 7^ . Each pro- 
cess then writes an equally sized chunk of the list. This 
results in a better lO performance than letting each pro- 
cess write scattered parts. The information stored in the 
adjacency file is later used by the solver (Fig. [He)). 



4. Preprocessing with graph partitioning 

The flow solver picks up the adjacency list for simu- 
lation and cuts it into equally sized chunks as shown in 
Fig. [lie). No respect is taken to the shape of the result- 
ing partitions. Graph partitioning algorithms could gen- 
erate more well-formed partitions. Therefore the widely 
used graph partitioner METIS fl and PT-SCOTCH ^ 
were employed for comparison. 

The full representation of the simulation domain is 
transformed into a graph. Each fluid node is considered 
as a vertex. PDFs represent edges between the fluid cells 
they origin from and the fluid node they are pointing to. 
Utilizing all PDFs, i.e. the complete neighborhood of 
a fluid cell, is called the "full neighborhood" (FN) or 
Moore neighborhood. It is possible to reduce the num- 
ber of edges of the graph by only using PDFs along the 
main coordinate axes, which we name "reduced neigh- 
borhood" (RN) or von Neumann neighborhood. 

To avoid confusion with the lattice partition that is 
owned by a preprocessor, we use the term "graph parti- 
tion" for the partition assignment generated by a graph 
partitioner. 

METIS. For METIS (version 4.0.3, k-way, FN) the 
graph is built serially by the preprocessor and given to 
METIS for partitioning. Subsequently the preprocessor 
determines the number of fluid cells belonging to each 
graph partition and computes the domain-wide unique 
contiguous start index 7^ of each graph partition. 

It is iterated in a lexicographic order (with blocking) 
over the cells of the full representation. The graph par- 
tition a fluid cell belongs to is determined and the next 
free index 7^ inside this graph partition is assigned. Af- 
ter all cells have been processed separate files contain- 
ing the sparse representation for each graph partition are 
written to disc and picked up later by the flow solver. 
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PT-SCOTCH. With PT-SCOTCH (version 5.1.12, RN 
and FN) partitioning is performed in parallel. The graph 
has to be setup in a parallel and distributed manner by 
the preprocessor. After partitioning, as with METIS, the 
number of cells of each graph partition has to be deter- 
mined and domain-wide unique contiguous start indices 
must be computed. 

A global lexicographic ordering (with blocking) over 
all cells in the simulation domain is used to visit each 
cell in the local partition. Index assignment is thereby 
performed in the same way as with METIS. Graph par- 
titions can spread across more than one process. In this 
case each process has to determine how many cells of 
which graph partition are located on its subdomain. The 
domain-wide unique contiguous index Ic is assigned to 
fluid cells of such graph partitions in contiguous blocks 
for each process. 

The processes of the flow solver would later 
pick up the generated sparse representation and cut 
it into equally sized chunks. But the generated 
graph partitions from PT-SCOTCH might not all 
have the same number of cells and therefore would 
not correspond with the partitions gained by sim- 
ple cutting. The sparse representation therefore 
is extended by a list of the domain-wide unique 
contiguous start indices of each graph partition 
for recovering them at the initialization of the solver. 

Note: The goal here is not to compare METIS and PT- 
SCOTCH. Results for these graph partitioners should 
only give an estimate of the performance that can be 
achieved with these tools, since graph partitioning has 
the disadvantage that the processing order of the cells is 
not defined, which can have enormous impact on per- 
formance. 

Further the behavior of PT-SCOTCH can be influ- 
enced by complex strategy strings specifying when and 
which partitioning method should be applied. Here just 
the default strategy string was used, which according to 
should yield good results in most cases. Also notable 
is that PT-SCOTCH produces different partitions when 
different numbers of processes for the same geometry 
and the same number of partitions are used. 

For the usage of graph partitioning the number of 
flow solver processes used for CFD simulation later 
must already be known at preprocessing time, as for this 
number of processes partitions are generated. This re- 
striction does not apply for our method. 



5. Test bed 

An overview of the two systems used as test bed is 
given in Tab. [T] The first system "Westmere" is a cluster 
with fully nonblocking QDR InfiniBand (IB) intercon- 
nect. The four-socket "MagnyCours" system is a large 
shared memory machine. 





Westmere 


MagnyCours 


Processor 


Intel Xeon 


AMD 




X5650 


Opteron 6176 


Freq. [GHz] 


2.67 


2.30 


max. Turbo [GHz] 


3.06 




Cores 


6 


2x6 


LI [kB] 


32 


64 


L2 [kB] 


256 


512 


L3 [MB] 


12 


5 


Sockets 


2 


4 


NUMA domains 


2 


8 



Table 1 : Details of the two systems Westmere and MangyCours used 
for preprocessing geometries and performance evaluation of the flow 
solver. The Westmere system is a cluster with QDR InfiniBand as 
interconnect. 



6. Resource requirements for preprocessing 

The preparation of the simulation domain ("pre- 
processor") and the flow simulation are separated in 
ILBDC. In the first step the simulation is prepared, i.e. 
the sparse representation is extracted out of the full rep- 
resentation of the domain. In the second step the flow 
simulation takes place. 

This section examines the resource requirements like 
run time and memory usage of the preprocessing. The 
next section then evaluates the performance of the CFD 
simulation itself. 

An empty channel (called channel) and a tube filled 
with a sphere packing (called packing) as in Fig. |2] were 
chosen as benchmark geometries. The empty channel 
exhibits nearly only fluid cells in contrast to the packing 
with a high fraction of solid cells and a complex struc- 
ture. Both geometries can be scaled to any diameter d 
with dimensions of (5 * d)xd xd cells. 

We use geometries with diameters d = 500, 800, and 
1000. The duration of the preprocessing depends on the 
dimensions and the ratio of fluid cells of the simulation 
domain. The "larger" a geometry becomes the more ob- 
servable the runtime becomes, which ranges from min- 
utes to hours. 
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(a) Slice in the xy plane. (b) Slice in the yz 

plane. 

Figure 2: Slices of a packed bed reactor of spheres. Blue indicates fluid cells and red solid cells. 



method 




nodes 


PPN 


memory [GB] 

max/proc. total 


duration [s] 

w/o lO w/ lO 


lO [GB] 


channel, d — 


800, 


«2,5* 


10'^ fluid cells 












lex. ordering 


1 


128 


1 




7.1 


908 


84 


883 


371 


lex. ordering 


100 


128 


1 




7.1 


907 


53 


834 


371 


PT-SCOTCH 




128 


1 




10.8 


1,298 


800 


3901 


371 


lex. ordering 


1 


256 


12 




2.4 


1,469 


57 


965 


371 


lex. ordering 


100 


256 


12 




2.4 


1,003 


14 


1479 


371 


packing, d — 


800, 


=^ 1,1 * 


lO'^ fluid 


cells 












lex. ordering 


1 


128 


1 




4.1 


413 


65 


382 


163 


lex. ordering 


100 


128 


1 




4.1 


411 


38 


345 


163 


PT-SCOTCH 




128 


1 




8.0 


638 


542 


2092 


163 


lex. ordering 


1 


256 


12 




2.1 


1,130 


66 


475 


163 


lex. ordering 


100 


256 


12 




2.1 


506 


22 


433 


163 



Table 2: Resource requirements for preprocessing an empty channel (« 2, 5 * lO' fluid cells) and a packed bed of spheres (» 1,1* lO' fluid cells) 
with dimensions of 4000 X 800 X 800 cells each on 128 (1 PPN) and 256 (12 PPN) compute nodes of the Westmere cluster with lexicographic 
ordering and PT-SCOTCH. With PT-SCOTCH preprocessing was neither possible on 128 nodes with 12 PPN nor on 256 nodes with 1/12 PPN. 
Runtime is dominated by disc lO to the parallel file system (Lustre), which achieved between 100 and 550 MB/s depending on the load on the file 
system. 



As an example we selected channel and packing with 
dimensions of 4000 x 800 x 800 cells each, which are 
» 2, 5 * 10^ and 1, 1 * 10** fluid cells, respectively. 

Table|2]shows details for preprocessing these two ge- 
ometries with our method (lexicographic ordering with 
blocking factor 1 and 100) and with PT-SCOTCH. Ob- 
viously the runtime is dominated by lO, i.e. writing the 
sparse representation to disc. The time for index gener- 
ation (w/o lO) here becomes negligible. Lexicographic 
ordering and PT-SCOTCH show heavily different dura- 
tions. Despite the difference in non-IO durations, the 
main cause is the varying lO bandwidth. The underly- 
ing parallel file system (Lustre) on the Westmere cluster 
is designed for an aggregated lO bandwidth of 3 GB/s 
but the bandwidth a particular job actually can see heav- 
ily depends on the load generated by other users and 
their applications. Typically, we only could sustain 100 



- 500 MB/s with our application in non-dedicated clus- 
ter operation. 

Comparing the memory usage when using 128 nodes 
and 1 process per node (PPN) for preprocessing re- 
veals diff'erences between lexicographic ordering and 
the usage of the graph partitioner. Fig. |3] shows a 
recorded memory usage trace (total memory usage over 
all used nodes) during preprocessing the packing geom- 
etry (d - 800) with lexicographic ordering (blocking 
factor 1 and 100) and PT-SCOTCH. 

PT-SCOTCH was still partitioning after 12 hours with 
128 nodes and 12 PPN nor with 256 nodes and 1 or 
12 PPN. Comparing lexicographic ordering for 128 and 
256 nodes shows that with a reasonable blocking factor 
like 100 the non-IO time decreases with more nodes. 
With more processes the total amount of memory used 
increases, because each process requires its own work- 
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Figure 3: Memory usage traces for preprocessing a packed bed of 
spheres with dimensions 4000 X 800 X 800 cells x: 1,1*10' fluid ceUs 
with lexicographic ordering and PT-SCOTCH. 



ing set. Also with blocking factor 1 larger interfaces 
are generated than with blocking factor 100. Therefore 
larger data structures are required to describe this in- 
formation which becomes significant as the number of 
processes increases. This explains the huge difference 
of total memory usage for lexicographic ordering with 
blocking factor 1 and 100 in Tab. |2] 

For PT-SCOTCH it seems favorable to use as few 
processes as possible for partitioning. With, e.g. 256 
nodes and 1 or 12 PPN (256 or 3072 processes), parti- 
tioning geometries with diameters > 500 did not finish 
after 12 hours. As an alternative we used 128 nodes to 
have enough total memory available but use only 1 PPN 
to reduce the number of processes. The reduced neigh- 
borhood was used to reduce the required memory for 
representing the graph. 

7. Performance of CFD simulation 

The benchmarked implementation of the flow solver 
uses double precision floating-point for representing the 
PDFs, a structure-of -arrays data layout, pull scheme 
1 1411 . and nontemporal stores for writing the new PDFs 
to the destination grid 11511 . One solver process was 
bound to each physical core, i.e., twelve MPI-processes 
per node on Westmere and 48 on MagnyCours were 
used. As performance metric we use fluid lattice up- 
dates per second (FLUP/s) which only takes the updated 
fluid cells into account. One lattice update involves ap- 
prox. 200 floating point operations (FLOPs) in our im- 
plementation. This number was extracted by inspecting 
the generated assembler source of the LB kernel. 

For evaluating the single node performance and 



strong scaling we used the channel and packing geome- 
tries with dimensions of 500 x 100 x 100 cells, which 
contain ^ 4, 8 * 10^ and 2, 1 * 10^ fluid cells, respec- 
tively. 

Single node performance. The performance of ILBDC 
depends heavily on the chosen numbering scheme used 
for setting up the adjacency list. Figure |4] shows that 
the performance varies by about 20% depending on the 
blocking factor (black curves). A blocking factor of one 
is equivalent to no blocking at all. 

Comparing data in Fig. |4(a)| and |4(b)| reveals that the 
geometry has no significant impact on performance, al- 
though the adjacency of cells and the number of fluid 
cells is diff'erent. The implementation of the discrete 
space-filling Z curve uses bitwise interleaving of cell 
coordinates (1 bit) or interleaves two-bit groups of the 
coordinates (2 bit), which in general results in better 
performance. METIS and PT-SCOTCH give a compa- 
rable performance which is always in the range of the 
maximum performance achieved with lexicographic or- 
dering. 

Strong scaling. Using more than eight nodes of the 
Westmere cluster the two benchmark geometries show 
nearly the same performance characteristics, albeit on 
different levels (Fig. |5(a)| i. This may be explained by 
the different number of fluid cells. It also reveals that 
choosing the right blocking factor for lexicographic or- 
dering has more influence on performance than on a sin- 
gle node. The data in Fig. |5]shows a 50% performance 
fluctuation on 64 nodes depending on the blocking fac- 
tor. With 128 nodes no more performance is gained 
since the number of cells each process owns is as low as 
X 1400 fluid cells. Utilizing even more nodes is coun- 
terproductive. 

Lexicographic ordering with blocking outperforms 
the Z curve (1 and 2 bit) as well as partitioning with 
METIS as shown in Fig. [5(b)] With METIS an 
early performance decrease of one third compared to 
other methods is observable at already 128 nodes. PT- 
SCOTCH generates excellent results for this geometry, 
in some cases even a slightly better performance than 
lexicographic ordering with blocking. 

Large geometries. For the evaluation of the perfor- 
mance of "larger " geometries we used 256 nodes of 
the Westmere cluster with 12 PPN resulting in 3072 
processes. Figure |6] shows the performance achieved 
with lexicographic ordering (blocking factor 1 and 100) 
and partitioning with PT-SCOTCH for the benchmark 
geometries channel and packing with diameters d — 
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Figure 4: Pull scheme with different numbering schemes: lexico- 
graphic ordering with blocking (black curves), Z curve (1 and 2 bit, 
purple and green curves), and partitioned with METIS (orange curves) 
on one node of Westmere and MagnyCours (twelve and 48 MPl pro- 
cesses, respectively). 



: 8, proc.= % 
I n = 16, proc. = 192 



>n = 32, proc. = 384 
n = 64, proc. = 768 
: 128, proc. = 1736 




30 40 50 60 
blocking factor 

(a) 



90 100 



4000 



3000 



u 2000 



"E 1000 



lex. ordering, blk. = 1 
lex. ordering, blk. = 50 
lex. ordering, best blk. 
Z curve, 1 bit 
Z curve, 2 bit 
Metis, k-way 
PT-Scotch, FN 




16 32 
nodes 

(b) 



64 



128 



Figure 5: Strong scaling on 8 - 256 Westmere nodes (12 PPN) for a 
packed bed of spheres (packing) with dimensions of 500 X 100 X 100 
cells a 2, 1 * 10^ cells for lexicographic ordering with blocking |( a) | 
and compared to Z curve 1 and 2 bit, and METIS [(bj] Dramatic dif- 
ferences in performance make the light choice of the blocking factor 
for adjacency list setup cmcial. 



500, 800, and 1000 cells. The dimensions of both ge- 
ometries are {d *5)x dxd cells. 

Lexicographic ordering (blocking factor 100) and 
PT-SCOTCH achieve nearly the same results except 
for packing d = 800. Here the performance of PT- 
SCOTCH decreases significantly. The cause might be 
the complex geometry in combination with the reduced 
neighborhood which was used for setting up the graph. 

The importance of the right blocking factor for lex- 
icographic ordering can be seen for both geometries. 
Over 200 % performance is gained by switching from 
blocking factor 1 to 100. This can be explained by ana- 
lyzing the generated partitions. 

Therefore we have analyzed the number of neighbor 
partitions and the number of remote links per partition. 
The latter one is the number of PDFs of adjacent parti- 



tions that must be communicated. 

With a blocking factor of 1 the domain is cut into 
(partial) slices. This causes a low count of neighbor 
partitions (Fig. |7(a)| left panel), but also increases inter- 
partition communication (Fig. |7(b)| left panel). For a 
larger blocking factor, e.g. 100, the number of neighbor 
partitions is increased (Fig. |7(a)| middle panel), but the 
number of links to communicate decreases (Fig. |7(b)| 
middle panel). 

Comparing the partitions gained by lexicographic or- 
dering with a blocking factor of 100 and PT-SCOTCH 
(Fig. |7(a)| and |7(b)| middle and right panels) more par- 
titions with a significantly increased number neighbor 
partitions (50 - 150) were generated. 

PT-SCOTCH generated for a packed bed of spheres 
with dimensions 5000 x 1000 x 1000 cells a graph par- 
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Figure 6: Performance of the flow solver on 256 Westmere nodes 
(12 PPN) for the channel and packing benchmark geometries pre- 
processed with lexicographic ordering (blocking factor 1 and 100) 
and partitioning with PT-SCOTCH. The geometries' dimensions are 
{5 * d)x d X d cells, where d denotes diameter of the geometry. For 
channel d = 1000 we were not able to generate a graph partitioning 
with PT-SCOTCH. 



tition (Fig. [8]) that had a lot of unconnected cells spread 
over nearly the complete simulation domain. Due to a 
restriction in the flow solver we are not able to use parti- 
tions for which the product of the boudning box dimen- 
sions exceeds 2^' cells. 
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(a) Number of neighbor partitions. Axis labels of the inset are the 
same as of the surrounding graph one's. 
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8. Conclusion and future work 

We presented an algorithm to preprocess a geometry 
for flow simulation that can compete with established 
graph partitioning tools. It provides the advantage that 
the number of flow solver processes does not need not 
to be known at preprocessing time. This number can 
also be freely chosen at every restart of the simulation. 
Independently of the number of processes used for pre- 
processing a specific geometry and a chosen numbering 
scheme always produce exactly the same sparse repre- 
sentation. 

Choosing the right blocking factor for lexicographic 
ordering is crucial for good performance of simulations 
based on the sparse representation of the simulation do- 
main. Unfortunately the best choice depends on the ma- 
chine architecture, the number of fluid nodes, the adja- 
cency of the nodes, the number of processes, and the 
communication network. 

We are working on a heuristic approach that arrives at 
reasonable predictions for the best blocking factor. Also 
we are investigating an auto-tuning approach which de- 
termines the right parameter set for a given geometry 
and architecture by only simulating samples of the ge- 
ometry. 



Figure 7: Histograms over the number of neighbor partitions ! (a) | and 
over the number of remote links per partition |(b)| i.e. the number of 
adjacent fluid cells' PDFs belonging to a dift'erent partition. 
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